Chapter 3 Reproducability
3.1 Reproducing data from a published paper
Here i am showing you how i am able to reproduce results from a published paper
the data used in this assignment comes from (van der Voet et al. 2021)
library(tidyverse)
library(here)
library(readxl)
library(rbbt)
library(RColorBrewer)offspring <- read_excel(here("data/CE.LIQ.FLOW.062_Tidydata.xlsx"), sheet = 1)
# we want to see if the data for the experimental conditions have been imported correctly
offspring %>% select(c("expType", "RawData", "compName", "compConcentration"))## # A tibble: 360 x 4
## expType RawData compName compConcentration
## <chr> <dbl> <chr> <chr>
## 1 experiment 44 2,6-diisopropylnaphthalene 4.99
## 2 experiment 37 2,6-diisopropylnaphthalene 4.99
## 3 experiment 45 2,6-diisopropylnaphthalene 4.99
## 4 experiment 47 2,6-diisopropylnaphthalene 4.99
## 5 experiment 41 2,6-diisopropylnaphthalene 4.99
## 6 experiment 35 2,6-diisopropylnaphthalene 4.99
## 7 experiment 41 2,6-diisopropylnaphthalene 4.99
## 8 experiment 36 2,6-diisopropylnaphthalene 4.99
## 9 experiment 40 2,6-diisopropylnaphthalene 4.99
## 10 experiment 38 2,6-diisopropylnaphthalene 4.99
## # ... with 350 more rows
# as we can see, the rawdata should have been an integer, the compname and expType should have been a factor and the compconcentration should have been a double. lets change that
offspring$RawData <- as.integer(offspring$RawData)
offspring$compName <- as.factor(offspring$compName)
offspring$expType <- as.factor(offspring$expType)
offspring_tidy <- offspring
offspring_tidy$compConcentration <- as.numeric(offspring_tidy$compConcentration)
# one of the values in compconcentration is accidentally classified as a character in excel and has now turned into a NA value, we will change this value manually.
character_placement <- which(is.na(offspring_tidy$compConcentration))
character_value <- offspring$compConcentration[character_placement] %>% str_replace(",", ".") %>% parse_number()
offspring_tidy$compConcentration[character_placement] <- character_value
# lets check one last time if the data types are correct.
offspring %>% select(c("RawData", "compName", "compConcentration"))## # A tibble: 360 x 3
## RawData compName compConcentration
## <int> <fct> <chr>
## 1 44 2,6-diisopropylnaphthalene 4.99
## 2 37 2,6-diisopropylnaphthalene 4.99
## 3 45 2,6-diisopropylnaphthalene 4.99
## 4 47 2,6-diisopropylnaphthalene 4.99
## 5 41 2,6-diisopropylnaphthalene 4.99
## 6 35 2,6-diisopropylnaphthalene 4.99
## 7 41 2,6-diisopropylnaphthalene 4.99
## 8 36 2,6-diisopropylnaphthalene 4.99
## 9 40 2,6-diisopropylnaphthalene 4.99
## 10 38 2,6-diisopropylnaphthalene 4.99
## # ... with 350 more rows
# they are so we can now use the data for further analysisoffspring_tidy %>%
ggplot(aes(x = log10(compConcentration + 0.0001), y = RawData)) +
geom_jitter(aes(shape = expType, colour = compName), width = .1) +
labs(title = "Amount of offspring from C. elegans incubated in different substances",
subtitle = "Experiment data from (van der Voet et al. 2021)",
x = "Log 10 of compound concentration",
y = "Amount of offspring per C. elegans",
colour = "Compound name",
shape = "Experiment type") +
scale_shape_discrete(labels = c("Negative control", "Positive control", "Vehicle A control", "Experiment")) +
scale_colour_brewer(palette = "Dark2") +
theme_classic()the positive control of this experiment is Ethanol and the negative control is no added substance.
to analyze this experiment I would follow these steps.
1. making a new column which shows which condition every worm is located in. (for example, group1 would consist of 2,6-diisopropylnaphthalene with a concentration of 4.99 nM, etc.)
2. checking normality for every condition.
NORMALLY DISTRIBUTED DATA:
3. perform ANOVA. with post-hoc tests and check if they differ from the control.
NOT NORMALLY DISTRIBUTED DATA:
3. perform kruskal - wallis test.
4. to visualize this difference, make a smoothed line graph for every the mean of every concentration per substance.
5. compare these graphs with each other.
normalized_value <- offspring_tidy %>%
group_by(compName) %>% filter(compName == "S-medium") %>%
summarise(mean = mean(RawData, na.rm = T))
offspring_tidy <- offspring_tidy %>% mutate(normalized_offspring =
RawData/normalized_value$mean)
offspring_tidy %>%
ggplot(aes(x = log10(compConcentration + 0.0001), y = normalized_offspring)) +
geom_jitter(aes(shape = expType, colour = compName), width = .1) +
labs(title = "Amount of offspring from C. elegans incubated in different substances",
subtitle = "Experiment data from (van der Voet et al. 2021)",
x = "Log 10 of compound concentration",
y = "Normalized offspring amount by mean of negative control",
colour = "Compound name",
shape = "Experiment type") +
scale_shape_discrete(labels = c("Negative control", "Positive control", "Vehicle A control", "Experiment")) +
scale_colour_brewer(palette = "Dark2") +
theme_classic()We normalize the data so we can see the difference between the different substances more easily.
3.2 Checking reproducability for published papers.
in this assignment, this study (Strobl et al. 2020) will be graded on the criteria for reproducibility.
and this study (Brewer, Robey, and Unsworth 2021) will be graded on code readability and reproducibility.
3.2.1 Pesticide influence on consumption rate and survival for bees.
- introduction of the paper
the use of pesticides is one of the main reasons of loss of biodiversity, and the combination of multiple pesticides could even make this worse. in this experiment it is investigated what the sublethal (food consumption) and the lethal (survival) effects of pesticides are on adult female solitary bees, Osmia bicornis.
to perform these tests, female solitary bees were divided into 4 groups:
– pesticide free (control)
– herbicide
– pesticide
– combined (both herbicide and pesticide)
their consumption rate and longevity were measured and the data from these two variables are used for analysis.
there is no significant difference in survival and consumption between te different groups. there is however a significant positive correlation between the consumption rate and the longevity of these bees.
- transparancy criteria grading
| transparancy criteria |
grading |
|---|---|
| study purpose | TRUE |
| data availability |
FALSE only part of the data is available |
| data location |
at the beginning/ at the end |
| study location |
TRUE materials/methods |
| author review |
location and email are present at the top |
| ethics statement |
FALSE |
| funding statement |
TRUE |
| code availability |
TRUE |
The part of the data that is available can be accessed through this directory: “data/insects-957898-supplementary.xlsx”
3.2.2 impact of analysis decisions for episodic memory and retrieval practices.
we will solely focus on the code of this paper to see:
– If the code can be understood easily.
– If I can reproduce one of the figures.
– If there are any bugs/flaws in the code.
the code is available in this website
the code has been copied to a new Rmd file in this repository under the name “_analysis_decisions_code.Rmd”
the data has been downloaded and is available in this repository under the name “data/AllDataRR.csv”
- changes made:
– changed the directory in line 11 so it retrieved the data used from this study.
– installed the packages in line 19 and line 180.
- first impression:
– (+) every test is in different chunks which makes readability easier.
– (+) clear comments on what is happening.
– (+) easy to understand code
– (-) chunks dont have names.
– (-) the individual results are far away from each other.
– (-) the same tests are are performed multiple times, making a function would make chances of mistakes less likely
- what this code is trying to achieve
the first part of the code for this experiment is looking for the correlation between individual and different studies (line 24-174)
the second part of the code for this experiment is looking at a correlation between the retrieval practice effect and the EM ability with the help of a graph. there are 2 graphs, one where everything is mean centered and one where it isnt.
- final judgement: (grading goes from 1-5(1 very hard/bad- 5 very easy/good))
– readability = 4
– reproducability = 5
– efficiency = 2
3.3 organisation of my files
Storing files in a way everyone knows where they can find everything makes it easier for your colleagues or complete strangers to find what they are looking for.
Down below you can see the file structure from one of my previous projects. as you can see all the files are stored in a similar way and can be easily found.